Annotating COMPARA, a Grammar-aware Parallel Corpus

نویسندگان

  • Diana Santos
  • Susana Inácio
چکیده

In this paper we describe the annotation of COMPARA, currently the largest post-edited parallel corpora which includes Portuguese. We describe the motivation, the results so far, and the way the corpus is being annotated. We also provide the first grounded results about syntactical ambiguity in Portuguese. Finally, we discuss some interesting problems in this connection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Name-aware Machine Translation

We propose a Name-aware Machine Translation (MT) approach which can tightly integrate name processing into MT model, by jointly annotating parallel corpora, extracting name-aware translation grammar and rules, adding name phrase table and name translation driven decoding. Additionally, we also propose a new MT metric to appropriately evaluate the translation quality of informative words, by ass...

متن کامل

Syntactical Annotation of COMPARA: Workflow and First Results

In this paper we present the annotation of COMPARA, currently the largest parallel corpora which includes Portuguese. We describe the motivation, give a glimpse of the results so far, and the way the corpus is being annotated, as well as mention some studies based on it.

متن کامل

What's in a Colour? Studying and Contrasting Colours with COMPARA

In this paper we present contrastive colour studies done using COMPARA, the largest edited parallel corpus in the world (as far as we know). The studies were the result of semantic annotation of the corpus in this domain. We chose to start with colour because it is a relatively contained lexical category and the subject of many arguments in linguistics. We begin by explaining the criteria invol...

متن کامل

Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation

In this paper we present an experiment to automatically generate annotated training corpora for a supervised word sense disambiguation module operating in an English-Hungarian and a Hungarian-English machine translation system. Training examples for the WSD module are produced by annotating ambiguous lexical items in the source language (words having several possible translations) with their pr...

متن کامل

Automatic Extraction of Tagset Mappings from Parallel-Annotated Corpora

Several research projects around the world are building grammatically analysed corpora; that is, collections of text annotated with part-of-speech wordtags and syntax trees. However, projects have used quite different wordtagging and parsing schemes. Developers of corpora adhere to a variety of competing models or theories of grammar and parsing, with the effect of restricting the accessibility...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006